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Abstract 

Inference problems with incomplete observations often aim at estimating popula- 
tion properties of unobserved quantities. One simple way to accomplish this estimation 
is to impute the unobserved quantities of interest at the individual level and then take 
an empirical average of the imputed values. We show that this simple imputation es- 
timator can provide partial protection against model misspeciflcation. We illustrate 
imputation estimators' robustness to model specification on three examples: mixture 
model-based clustering, estimation of genotype frequencies in population genetics, and 
estimation of Markovian evolutionary distances. In the final example, using a rep- 
resentative model misspeciflcation, we demonstrate that in non-degenerate cases, the 
imputation estimator dominates the plug-in estimate asymptotically. We conclude by 
outlining a Bayesian implementation of the imputation-based estimation. 

Key words: exponential family, imputation, incomplete observations, model misspecifl- 
cation, robustness 
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1 Introduction 



We are interested in robustness to model misspecification in problems with incomplete ob- 
servations. Semiparametric approaches have enjoyed a lot of success in this area b ut these 



methods lack universality and so need to be fine-tuned for each prob lem at hand 



2006 



Little and An 



2004 . 



Kang and Schaferl . 



Tsiatis 



2007 



Chen et al. 



20091 ] . Consequently, when 



practitioners are faced with nonstandard problems with incomplete observations, they are 
often left to their own devices. As a first step to ameliorating this deficiency, we propose a 
general imputation-based estimation method that provides partial protection against model 
misspecification for incomplete data problems. 

The idea of using imputation techniques to combat model misspecification is not new. 
Consider the standard missing data problem of estimating population mean [i given a sample 
(zx,yxzi, xi), . . . , (z n , y n Zn, x n ), where z/i is a response variable, Zi is a response indicator tak- 
ing value 1 if yi is observed and otherwise, and Xj is a vector of covariates. Assuming strong 
ignorability, meaning that y^ and %i are independent given x i; we use only those individuals 
for which the response variable is availa ble to fit a response model with rrii = E(?/j | to 



obtain rhi 



Rosenbaum and Rubin 



19831 ] . Intuitively, we can combine the empirical estimate 



of the mean of respondents with model-based predictions of missing yiS for non-respondents 
to arrive at a = (1/n) y^".._ 1 Vi + (1/n) Yli-z =o r ^ j - This estimator, called an imputation 
estimator by iTsiatis and Davidianl 20071 ]. will be biased if the response model is misspeci- 



fied. However, the bias vanishes as the number of non-respondents decreases t o zero. Using 



Tsiatis and Davidian 



2007] 's imputation 



conditioning on the observed data, we can rewrite 

esti mator as ji = {1/n) Y^ =1 E(?/i | Zi, yi%i, Xj). In a completely unrelated missing data set- 



ting, 



O'Brien et al. 



20091 ] also use expectations of complete data condit ional on the observed 



O'Brien et al. 



20091 



data to arrive at novel estimators of evolutionary distances. Although 
used imputation by conditional expectations explicitly, these authors did not recognize the 
full generality of their approach. 

In this paper, we investigate the behavior of imputation estimators when they are ap- 
plied to general problems with incomplete observations. After formulating the generalized 
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imputation estimator, we consider three problems with incomplete observations. We start 
with a mixture model and demonstrate that imputation is useful for estimating densities 
of mixture components. Moreover, this imputation density estimation improves accuracy of 
mixture model-based clustering. Next, we turn to a statistical genetics problem of estimating 
genotype frequencies. To keep the genetic-specific intricacies to a minimum, we construct 
an artificial but representative example. In spite of the introduced simplification, our re- 
sults are directly applicable to a topical problem of multilocus haplotype/genotype frequency 
estimatio n, where model m i sspec i fication occurs du e to a failure to account for population 



structure 



Allen and Satten 



2008 



Kraft et al. 



20051 ] . In our last example, we consider impu- 



tation estimators of evolutionary distances bet ween DNA sequences with partially observed 



continuous-time Markov chains introduced in 



O'Brien et al 



2009|. We fill some theoreti- 



cal gaps in their work. First, we identify situations where imputation estimators are not 



helpful. In doing so, we - for the first time to our knowledge - use the 
group -based Markov models belong to the regular exponential family 



act that so called 



Evans and Speed! . 



1993]. Next, we compute almost sure limits of imputation and plug-in estimators for a par- 



ticular model misspecification. Although we make several simplifying assumptions in this 
derivation, we believe that qualitat ively our results are portable to more realistic applications 



considered by 



O'Brien et al 



2009]. We conclude by outlining a Bayesian implementation of 



the imputation-based estimation. 



2 Generalized imputation estimators 

Assume that complete data x = (xi,...,x n ) are independent and identically distributed 
with each x, distributed according to a parametric family of sampling densities Pt( x ; Qt) 
with parameters 6? G &t- We observe each Xj through a transformed vector = y(xj). 
We further assume that the true sampling density pr(xi; Qt) is unknown to us and we have 
to erroneously postulate a misspecified model pj?(xi;0p), where Op G &f with parameter 
spaces ®t and 0^ of possibly different dimensions. Despite this model misspecification, we 
would like to estimate = E© T [s(xi)] = J s(xi)pr(xi; 0r)dxi, where s is an arbitrary mea- 
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surable function that maps complete data to an m- dimensional vector of summary statistics. 
Assuming that Op is identifiable from incomplete data y = (y y„), one can simply 
maximize the likelihood of the observed data niLiPHy*; ®f) to arrive at the maximum like- 
lihood estimate Op = argmaxg ge p.p(y; Then, ignoring model misspecification, we 
use Op to get the plug- in estimate of the complete-data summaries 



ffi = E dip [s(xi)] = J s(x 1 )p F (x 1 ;0 F )dx 1 . 



(1) 



This estimator is destined to be biased and asymptotically inconsistent in nearly all situations 
due to the model misspecification. 
Consider an imputation estimator 



- E o F t s ( Xi ) I yd = - / s (xi)PF& I y 4 ; o F )d^i. (2) 
n 71 „_i J 



n 

i=l »=i 



The motivation behind this new estimator is quite simple: in order to offer protection against 
model misspecification, we would like to use the empirical measure based on y 1; . . . , y n . To 
accomplish this, we write Eg T [s(xi)] = Eg T {Eg T [s(xi) | yi]} ~ P n Eg T [s(xx) | yi] where 
Pn/ — ~Yli=i f(yi) f° r an y measurable function /. In the absence of a good alternative, 
we plug-in dp for Op in the conditional expectations of s(xj) to arrive at our imputation 
estimator, 

If the family of distributions {pF(y; Op)} satisfies usual regularity conditions we have 
Op ^4-' #o- F° r example, if our model is not misspecified, i.e. &p = ®p, we would 
have = Op. Consider the family of functions J 7 which consists of conditional expec- 
tations: T = {/(yi;0) = Eg [s(xi) | yi)] , G @o} for some bounded open neighbor- 
hood ©o of the limiting value 0q. If we assume that T has finite bracketing number 
iVn(e, J 7 , L\(P)) for each e > and is pointwise continuous in 0, then one can show 
that PnEft s [xi) | y-i] ^4' E» T {Efl n [s( xi) |yi]} using standard empirical processes machin- 
ery van der Vaart and Wellnerl . 120001 ] . Assuming model misspecification almost inevitably 
leads to ^ Op. Therefore, our imputation estimator has little chance of achieving asymp- 
totic consistency. However, if the fraction of missing information is relatively small, our new 
estimator can be quite close to the true value both for finite sample sizes and asymptotically. 



Assume that a misspecified complete-data sampling density belongs to the regular expo- 
nential family so that Pf( x i> f ) = a(x!)exp [0^t(xi)l /b(6 F ), where t(xi) = (tx(xx), . . . , i r (xx)) 
is an r-dimensional vector of minimal sufficient statistics and 0f = (Ow , • • • , Oft) is a natural 



Sundberg 



1974 the likeli- 



parameter vector of the same dimension. Then, as noted by 
hood equations based on the observed data y can be written as (1/n) J2i=i ^6» F [t(xj) | y$] = 
E 0F [t(xi)]. Therefore, if the complete-data summary s(xi) can be expressed as a linear 
transformation of the sufficient statistics t(xi), imposed by the falsely assumed regular ex- 
ponential family model, then the plug-in estimator ([Q) and imputation estimator (|2j) coincide 
exactly regardless of the true sampling density of Xj.. 



3 Mixture models and model-based clustering 

Consider a mixture model with k components. Let h = (hi, . . . , h n ) be iid discrete random 
variables taking values in {1, ... , k) with probabilities Pr (h\ = j) = ctj, Y^j=i a i = 1- E ven t 
hi = j indicates that the observed y^ is sampled from the density p F j(y; @Fj)- The complete- 
data sampling density becomes 

k 

p F {hi,yi; F ) = JJ [ajPFj{yu Fj )] 1{ht=j} 
3=1 

We obtain parameter estimates a = (ax, . . . , a^) and 6 F = (9pi, • • • , 6 F k) by maximizing 
K=iPF{yuO F ), where p F { yi ; F ) = 

= i a jPFj{yi] 0Fj)- If we further assume regular ex- 
ponential family sampling densities of mixture components sharing the same normalizing 
constant a(y), p F j(y;p F ) = a(y) exp [tj(y) T F j] /bj(9 F j), then the density of the ith com- 
pletely observed sampling unit also belongs to the regular exponential family, 

{k k 
1 {h l =j} t j(yi) T °Fj + Y 1 {hi=3} 
3=1 j=l 

From our discussion of regular exponential family complete-data likelihoods, it is clear that 
plug-in and imputation estimators of mean complete-data summaries, 

E 0T [l{hi=j}tj{yi)] and Eg T [l{ hi =j}] , (3) 
5 



In 



will coincide exactly regardless of the true complete-data sampling model Pt{Yi, hi; Or)- 
In fact, plug- in and imputation estimators of the second mean complete-data summary, 
Eg T [l{hi=j}], will coincide even if densities PFj{Yi'i^F) do not belong to the regular ex- 
ponential family. To see this, note that the plug-in estimator in this context is a p ^ = 
Efi. [l{fci=j}] = P r (^-i = i) — OLj- The estimated probability that observation i belongs to 
component j is 



E (kh^j} I yi) 



52 i=1 ap F j(yi,0 Fj ) 



The imputation estimate of the jth mixing proportion becomes d* m = (l/n) Y^h=i ^ { l {hi=j} I Yi) 



f Mn) EILi 



that a? 1 



The likelihood equations for the mixture model can be rearranged to show 



Redner and Walker 



1984] . Notice that estimating all of the above complete- 



data expectations requires unambiguously identifying mixture component j, which we as- 
sume is possible by imposing constraints on mixture component parameters 9 fx, ■ ■ ■ , 0Fk- 

To make our discussion of mixture models more concrete, we simulate n = 1000 realiza- 
tions from a mixture of two log-normal distributions with the log-scale means fii = 1.5 and 
fi2 = 2.5 and standard deviations o\ = 0.2 and 02 = 0.25 respectively. The mixing propor- 
tion, a, was set to 0.3, completing the set of true model parameters 0t = (pi, ^2, &x, 02, a). 
Now, we assume a two-component normal mixture model with means z/ 1; v 2 i possibly un- 
equal standard deviations 8\, 62, and a mixing proportion 0. We estimate parameters 
Op = ( Vi ,v->.,6-\,5?., 0) of this misspecified model using m aximum likelihood via the EM algo- 



rithm [Dempster et al. 



1977 



Fraley and Rafteryl . 120031 ] . We show a histogram of simulated 



data with a normal mixture model fit in the left plot of Figure [TJ 

To avoid the label switching problem, we define mixture component labels by the inequal- 
ity ui < v 2 - Equation fl3]) says that if we try to estimate Eg T (l{/i 1= i}?/i), E 6t (l^^xyyT) 
or Eg T (l{7j 1= i}J, it does not matter whether we use the plug-in or imputation approach. 
Instead, we choose to estimate the proportion of samples from the first mixture component 
that fall to the right of some threshold c, /i(c) = Eg T (l{h 1= i}l{ yi>c }) . The plug-in estimate 
of this quantity is 



/i**( C ) = Eg F (l {hl = l} l {yi>c} ) 
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1 - $ 



c — V\ 
~lx~ 



(4) 
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Figure 1: Mixture model example. The upper left plot shows a histogram of 1000 simulated 
realizations of the two-component log- normal model, described in the text. The solid line depicts 
the normal mixture density estimated from these simulated data. The dashed vertical lines indicate 
four values of threshold c, for which we estimate /u(c) = Eg T (l{/ ll= i}l{j />c }) . Results of conventional 
and robust estimation of these quantities are shown in the upper right plot of the figure. We repeat 



simulation and estimation 1000 times and plot box-plots of relative errors, — ^"^^ and ^ , 

for c = 5.0, 5.5, 6.0, 6.5. The bottom row shows results of mixture component density estimation 
and classification errors during model based clustering. 
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A <m (c)-Mc) 



where $ is the standard normal cdf. Our imputation estimator becomes 

Y n 1 n 

" »=i ft »=i 

Since tails of mixture components can be estimated via imputation, it should be possible 
to devise an imputation estimator of mixture components' densities. Indeed, if we use a 
nonparametric kernel density estimator, where each observed point i is weighted by z^-, we 
arrive at an imputation estimate of the jth component density. This is potentially useful, 
because more acc urate estimation of compo nent densities may lead to more accurate model- 



based clustering [Fraley and Rafteryl . 12002 1. 



The right plot of Figure [TJ demonstrates results of estimating //(c) for threshold values 
c = 5.0, 5.5, 6.0, 6.5, depicted in the left plot of the figure by the dashed vertical lines. 
We consider these values of c, because they fall into the region where sampled points can 
not be easily assigned to either of the two mixture components. We simulate 1000 test 
data sets using already described settings. For each of the simulated data set, we compute 
plug-in estimates of //(c) using the fitted correct log-normal and the misspecified normal 
model and the imputation under the misspecified normal model. We show box plots of the 
corresponding relative errors in the upper right plot of Figure [TJ Although, the performance 
of the plug-in and imputation estimators under model misspecification is disappointingly 
similar, imputation density estimates, plotted in the bottom row, look more promising. We 
used plug-in density estimates under the correct and misspecified model and imputation 
density estimates to assign simulated points to two clusters. We then co mputed clustering 



classification error using R package MCLUST [Fraley and Rafteryl . |2003| . As shown in the 
lower bottom plot, clustering accuracy improves significantly under imputation estimates of 
mixture component densities and approaches the accuracy of clustering under the correct 
mixture model. 
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4 Estimating genotype frequencies 



Here, we turn to a classical problem in st atistical genetic s : esti mating allele and genotype 
frequencies from incomplete observations [Ceppelini et al.l . Il955j . Suppose that we measure 
some observable characteristic, called a phenotype, in n individuals and record them in a 
vector y = (yi, . . . , y n ), where each y^ takes one of M possible values in C — {c\, . . . , cm}. 
We further assume that each individual i has an unobserved genotype = (xn, x i2 ), defined 
as an unordered pair of gene variants, called alleles, on two paired chromosomes of this 
individual. Suppose there are R possible alleles, Q = (gi, ■ ■ ■ , Genotypes are assumed 
to determine observed phenotypes via a deterministic function h : Q x Q — > C such that 
h{gkiQi) — h(gi,gk). Making certain population genetics assumptions allows us to assume 
that unobserved genotypes are iid with 



PT((9k,9i);P,f) 



pI(i -f) + f Pk if k = i 



(5) 



where p = (pi, . . . ,pr) are population allele frequencies and / is called an inbreeding coef- 
ficient. We erroneously assume that / = 0, re ducing t he model for genotype probabilities 
to the celebrated Hardy- Weinberg equilibrium [Hardyi . Il908l . IWeinbergl . Il908( |. The falsely 
misspecified complete-data likelihood for datum 1 becomes 

R R 

PF (x i; p) = n (2^) 1{xi=(9 ^ )} H( Pi ) 2xi ^=^)} cc Y[pi\ 

k>l k=l k=l 

where ^ = 2x ^{xi=(g k ,g k )} + J2iLi l{xi=(sfc,sj)}- The misspecified observed-data likelihood for 
datum 1 is p F (yi, p) = E Xi: ft( Xl ) =2/a Pf(xi] p) 

Since the complete-data likelihood is in the regular exponential family with sufficient 
statistics (h, . . . the plug-in and imputation estimates of EQ^Lj CiU) will coincide ex- 
actly. Suppose our objective is to estimate genotype frequencies \iu = E (l{xi=( Sfc , S; )}) = 
Pr (xi = (gk,gi))- The complete-data summary l{xi=( 9fe , 9i )} can no ^ ^ e expressed as a linear 
combination of the sufficient statistics, so plug-in and imputation estimation will not nec- 
essarily produce identical results. After obtaining maximum likelihood estimates of allele 



Table 1: Mappings of complete to observed data during genotype frequencies estimation. 
Ambiguous phenotypes are highlighted in bold. 



(9k 


9i) 


(A, A) 


(A,B) 


(AC) 


(AD) 


(B,B) 


(B,C) 


(B,D) 


(C,C) 


(C,D) 


(D,D) 


h(g k 


9i) 


aa 


ab 


ac 


ad 


bb 


bdc 


bdc 


cc 


cd 


dd 


h 2 (g k 


9i) 


aa 


ab 


ac 


ad 


bd 


bdc 


bdc 


cc 


cd 


bd 



frequencies, p, the plug-in approach yields 

££ = Pl^{k=i} + 2p fe pjl {fe ^ } . 

The imputation estimator becomes 

~im / / \ | \ i njpl nj2p k pi 

Vkl = ~ Pr Xi = (9k, 9i) Vi = cj) 1{ w =cj} = 7 rrl{fc=«} + — L ~, -l{k^i}, 

where h(g k ,gi) = Cj and rij = Yn=i 1 {y l =c 3 }- 

Consider a particular case of the above model with four alleles: Q = {A, B, C, D}. Tabled] 
defines two mappings from genotypes to phenotypes, h\ : Q x Q — > C\ and hi : Q x Q — > C 2 , 
where C\ = {aa,ab,ac,ad,bb,bdc,cc,cd,dd} and C 2 = {aa,ab,ac,ad,bd,bdc,cc,cd}. Notice 
that C\ has 9 phenotypes and C 2 has 8 phenotypes. Therefore, the fraction of missing data is 
larger under mapping h 2 than under hi. We simulate 1000 observed phenotypes under both 
mappings using complete-data model (jSJ) with pa = 0.3, ps = 0.2, pc = 0.2, pp, = 0.3 nd 
/ = 0, 0.125, 0.25, 0.375, 0.5. For each of these 10 simulated data sets, we estimate allele 
frequencies pa, Pb, Pc, and pD using the EM algorithm and assuming that / = 0. 

For phenotypes that unambiguously correspond to exactly one genotype, the empiri- 
cal phenotype frequency can be used to estimate the corresponding genotype frequency. 
Therefore, it only makes sense to compare plug-in and imputation estimation for genotypes 
that correspond to ambiguously defined phenotypes. For example, under both hi and h 2 
genotypes (B, C) and (B, D) correspond to the phenotype bed. Suppose our goal is to esti- 
mate these genotype frequencies: \lbc = Pr ( x i = (B, C)) and hbd = Pr (xi = (B, D)). 
Plug-in estimates of these population-level quantities are 

fi^ c = Ipspc and p% D = 2p B p D . 
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8 Phenotypes, "BC" Frequency 
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Figure 2: Genotype frequency estimation. We plot box plots of relative errors, of plug- 
in and imputation estimates of genotype frequencies (fine an d /Abd) for two incomplete 
data mappings, with 9 and 8 observed phenotypes. Each pair of white and grey box plots 
corresponds to an inbreeding coefficient that ranges from to 0.5. 



Imputation estimates are obtained as 

n bc d PbPc 



pirn 
^BC 



n PbPc + PbPd 



and p™ D 



Kbcd 



PbPd 



n PbPc + PbPd 
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where ribcd = J2i=i ^-{yi=bcd} and n = 1000. Figure [2] shows box plots of relative errors of plug- 
in and imputation estimators, obtained by repeating the above simulation and estimation 
steps 1000 times. In the case of 9 phenotypes, corresponding to hi mapping, the imputation 
estimation offers remarkable protection against model misspecification. Decreasing the num- 
ber of observed phenotypes from 9 to 8 results in the imputation estimators outperforming 
the plug-in one only for / = 0.125 and / = 0.25. For the rest of inbreeding coefficient 
values, plug- in estimation produces better estimates of (jlbc, while imputation estimation 
offers better estimates of \ibd- However, overall imputation relative errors are still smaller 
than plug-in errors. 



5 Labeled evolutionary distances 



Imputation estimation was proposed by 



O'Brien et al. 



20091 ] in the context of estimation 



of evolutionary dista nces between mo 



evolutionary biology [Gu and Li 



1998 



ecular sequen ces, a standard problem in molecular 



Yangl, 



2006t | . Consider a 2 x n matrix y = {yij}, 



where each takes values in the S = {1, . . . , s}. We assume that all columns in y are 
independently generated by the same reversible and irreducible continuous-time Markov 
chain (CTMC) defined on the finite state-space S by infinitesimal generator A(6 F ). 

This Markov process models the evolution of DNA sequences so that the state space S usually 
consists of 4 nucleotide bases, however, a couple of alternative state-spaces are also often used. 
Each column in y is produced by first drawing yu from the stationary distribution of {X t }, 
7r(6 T ) = (iti(0t), ■ ■ ■ tt s (0t)), running the chain for an unknown time t and setting y 2 i = X t . 
For each realization i, we observe only the starting and ending states of the Markov chain 
on the time interval [0, £]. Here, model misspecification usually manifests itself through an 
incorrect parameterization of the infinitesimal generator, A(0p). The misspecified likelihood 
of the observed data is p F (y;0 F ) = YYl =1 ^ yii (0T)p yii y 2i (0 F ,t), where ~P(6 F ,t) = e A(0F) * = 
{Pij{^F,t)} and pij(0 F ,t) = Pr (X t = j \X = i) are finite-time transition probabilities of 
{X t }. Notice that transition probabilities depend on A and t only through their product. 
Therefore, we require the identifiability constraint t — 1. 
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In this example, complete data consist of the full Markov chain trajectory Xj = {X ri : 
< r < £}. A complete-data summary of scientific interest is s(xi) = Nc, the number of 
transitions of X t during the time interval [0, 1], labeled by the set of ordered state pairs C. 
In the absence of complete Markov trajectories, we are interested in the mean number of 
labeled transitions of the stationary Markov chain, available analytically via 



H = E 8t [s( Xi )] = E 0T (N c ) = 7r(0 T ) T A c (0 T )l, 



(6) 



where 1 is an s-dimensional column vectors of Is and Ac = {X uv x l{i u ,v)eC}}- in molecular 
evolution, this expected number of labeled Markov transitions translates into mean number of 



labeled mutations, al 
in a flexible manner 



owing evolutionary 



O'Brien et al. 



Diologists to measure molecular sequence similarity 



2009 ]. 



The plug-in approach for estimating \l proceeds by first fitting a possibly misspeci- 
fied Markov model, AiQp) and then using the resulting parameter estimates to compute 
complete-data summary expectations. More specifically, we obtain Op = argmax 0F p(y; 0p) 
and obtaining plug-in and imputation estimators 

1 n 

= 7r(e F ) T A(e F )l and fi m = - % [Af | X = y u , X, = y 2l ] . 



O'Brien et al. 



i=i 



20091 ] execute two extensive simulation studies that demonstrate that the 
imputation estimator offers remarkable protection against misspecification of a Markovian 
mutational model. 



5.1 Complete- data likelihood 

After falsely assuming a misspecified model parameterization A(0p) (and Tz(0p) as a re- 
sult) we condition on the initial Markov chain states and write the misspecified conditional 
complete-data likelihood 



Pf(X\ 



[0,1] i °F 



e F ) oc 



x e 



E« = i T U \ UU (0 F ) 



(7) 



where n uv is the number of times X t instantaneously jumped from u to v and T u is the total 
time X t spent in state u during the time interval [0, 1], both summed over all n realizations 
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of the Markov chain Guttorpl . Il995| . The complete-data likelihood belongs to the curved 
exponential family with sufficient statistics n = {n uv } u ^ v and T = (Ti, . . . ,T S ). 

Nearly all Markov infinitesimal generators used in molecular evolutionary biology fall into 
the set A = {A = {X uv } : X uv = n v a uv for u ^ v}, where a = {a uv } is a symmetric matrix. 
Such parameterization ensures reversibility of the Markov chain, a common assumption in 



the field of molecular evolution 



Yang 



2006 | 



5.2 Group-based models 

Notice that the likelihood ([7]) simplifies significantly if we assume a reversible model with 
equal diagonal entries of A: 

p F (X m ;cx)^l[a2 v+n -, (8) 

u<v 

because Ylu=i T u = 1 is the length of the observational time interval. It tur ns out that in 



mole cular evolution, only so called group-based models satisfy these properties [Evans and Speed 



1993| . Group-based Markov evolutionary models can be defined as continuous-time random 
walks on Abelian groups. If we define an Abelian group on a Markov chain state space S 
with algebraic operation "+" , then entries of the corresponding group-based CTMC gener- 
ator A must satisfy \ uv = g{u — v) for some function g : S — >■ [0, oo). For example, the 
most general group-based model on the state space of DNA bases {A, G, C, T} is a Kimura 
three-parameter model with 

\ 



A 



K3P , 



a, 



corresponding to the Klein group Z2 © Z2 



/_ 


a 


P 


7 


a 




7 


P 


P 


7 




a 




P 


a 





(9) 



Evans and Speed 



1993 [. 



Group-based models, co nstructed with algebraic symmetry in mind, fin d extensive use 



in statistical phylogenetics 



Sturmfels and Sullivant 



2005 



Steel et al 



1998) . For us, these 



models are appealing because they turn the completed-data CTMC likelihood into the regular 
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exponential family form. If we break all possible DNA mutations into three classes and define 
their corresponding counts N AGiCT = n AG + n GA + n CT + n T c, N AG)GT = n AC + n GA + n GT + 
Utg, and N A t,gc = n A T + riTA + n GG + n GG) then these counts form the sufficient statistics for 
the Kimura three-parameter model. From our discussion of the regular exponential family it 
follows that plug-in and imputation estimates of E a)J g )7 (ciN AGjGT + C2N AG)GT + c 3 N ATjGG ) 
will coincide exactly regardless of the tru e sampling model an d of the choice of constants ci, 



C2, and C3. This fact was not noticed by 



O'Brien et al. 



2009f | . because the authors did not 



consider group-based models explicitly in their work. 



5.3 A closer look at observed data likelihood equations 

Instead of invoking properties of the regular exponential family, one can find more general 
conditions under which imputation and plug-in estimates of labeled evolutionary distances 
coincide, as demonstrated by the theorem below. 

Theorem 1. Let y = {yij}, % = 1, 2, j = 1, . . . , n, be a pairwise sequence alignment generated 
by a CTMC with an unknown infinitesimal generator A(Ot) as described at the beginning of 
this section. We take A(6p) to be a misspecified model and Op = (Ofi, ■ ■ ■ , #Fr) to be the 
corresponding maximum likelihood estimator obtained from the observed data y. If 



A C {0 F ) - I x ir T (0 F )A c {O F )l e 



/dA{Oi 



\ de F1 



0A(0j 



p — Of 



00 



Fd 



Op — Of 



(10) 



where C C S 2 \ : % G S} is a set of ordered Markov state pairs and A c = {X uv x 

~L{(u,v)eC}}, then 

1 ™ 

E dF {Nc) = - J2 E 9 F {Nc I X = y u , X, = y 2i ) , 



8=1 



where Nc is the unobserved number of Markov chain transitions labeled by the set C. 
To illustrate the above theorem, consider a Kimura t wo-parameter model A K2P (a,/3) 



A K3P (a, /3, 0), obtained by setting 7 = (3 in matrix 



Kimural . 



1980j. For both of these 
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models, the stationary distribution is ir T = (0.25,0.25,0.25,0.25). Let 

A = {(A, G), (G, A), (C, T), (T, C)} and (11) 
C 2 = {(A, C), (C, A), (A, T), (T, A), (C, G), (G, C), (T, G), (G, T)} (12) 

be two mutational classes of interest. The partial derivatives of the Kimura two-parameter 
generator, 

^-A K2P (a,/3) = - [A Cl - 1 x tt*A £i 1] and 
^A K2P (a,/3) = ^[A £l -Ix 7 r t A £2 l], 

satisfy condition fflUj) . Therefore, Theorem 1 says that plug- in and imputation estimators 
of E olt p(Nc 1 ) and E a) p(Nc 2 ) coincide exactly. Of course this example reiterates the fact 
that complete-data likelihood of the Kimura two-parameter model belongs to the regular 
exponential family with sufficient statistics and Nc 2 . 



5.4 Misspecified Kimura model: asymptotic behavior 



Studying asymptotic properties of our imputation estimator is challenging in general even 
for the specific problem of the evolutionary distance estimation. Therefore, we turn to 
an elementary example to obtain some basic asymptotic results. First, we introduce the 
simplest group-based model on the nucleotide state space, known as a Jukes-Cantor model. 
The infinitesimal generator of this Markov c hain is obtained by setting a = j3 in the Kimura 
two-parameter model, A JC (7) = A K2P (7,7) 



Jukes and Cantorl . 



1969 [. 



Theorem 2. Assume that observed sequence data y was generated from the Kimura two- 
parameter model with generator A K2P (a, (5). Let 7 be the maximum likelihood estimate, 
obtained by fitting a Jukes-Cantor model with generator A. JC (^) to y. Then as the number 
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a = 0.01 



a = 0.1 





0.00 0.01 0.02 0.03 0.04 0.05 



Figure 3: Estimator limits in the Kimura model. In both panels of the figure we plot the true value 
of the mean number of labeled mutations fi = ~E(Nc) (solid line), the a.s. limits of the plug-in 
(dotted line) and imputation (dashed line) estimators. 

of columns in y, n, approaches infinity, 



1 / 1 + Op 2 ^-") 



?? 



-4 



i=i 



1/1 + 2^ 
H 4 1 3 



;L 4(e- 4 ^ + 2e- 2 ^)(e- 4/3 - e~ 2 ^) 



3(3 -e- 4 ^ -e-2(°+«) 



where L\ is defined by equation (Q2P- 



Corollary 1. Under the conditions of Theorem^ define 

1 - 

V = E a ,p (N Cl ) , fC = lim (N Cl ) , /C = lim - V (N Cl \ X = y u , X 1 = y 2l 



i=l 



JTien l/i^ 1 — /i| < — /i| u>/ien a ^ (3. In other words, the imputation estimator asymp- 
totically is always better than the plug-in one. 
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To illustrate the above theorem and its corollary we plot the true value (//) and a.s. 
limits of the plug-in (/i^) and imputation (/x^) estimators as function of (3 in Figure EJ We 
fix a = 0.01 in the left panel and a = 0.1 in the right panel. Roughly speaking, the left 
panel shows the behavior of the estimators when the overall mutation rate is low, while the 
right panel corresponds to a high mutation rate scenario. The lower the mutation rate, the 
better our imputation estimator behaves asymptotically. This property of the imputation 
estimation is expected, because low mutation rate translates into lower fraction of missing 
data, which in turn makes the imputation estimation more powerful. We have already seen 
this behavior of the imputation estimator in the previous examples. 



6 Bayesian implementation 



6.1 General recipe 

Although all examples so far were analyzed from the maximum likelihood perspective, one 
can easily perform imputation-based estimation in a Bayesian framework. To accomplish 
this, we first need to assign a prior distribution p(0p) to the parameters of our misspecified 
model pp{y;0p). We assume that it is possible to obtain either the posterior distribution 
Pf(0f | y) or the augmented posterior ppiOp-, x | y Y, possibly approx i matin g these distri- 
butions via Markov chain Monte Carlo (MCMC) Tanner and Wongl . Il987| |. Using these 
posterior distributions, we define plug-in and two imputation predictive distributions 



p(E^[s(xi)] |y) 



V 



n ^ 

i=i 



and 



V 



( 1 - 



As before, we hope that the latter two will provide us some protection against model mis- 
specification. These last two predictive distributions have the same mean, but conditioning 
reduces the variance of the third distribution. This is similar to Rao-Blackwellization in 
Monte Carlo sampling 



Casella and Robert 



1996]. but since we arc working under the as- 



sumption of model misspecification, smaller variance is not necessary a desirable property 
of a predictive distribution. 
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6.2 Bayesian estimation of genotype frequencies 

To illustrate the Bayesian implementation of our procedure, we revisit the genotype fre- 
quency estimation example. We generate 10 phenotype samples using the two genotype-to- 
phenotype mappings defined in Tableland setting the inbreeding coefficient / and true allele 
frequencies to the values we used in the original example. We place Dirichlet(l, 1, 1, 1) prior 
on allele frequencies and approximate the posterior distribution of complete data (genotype 
counts) and allele frequencies via Gibbs sampling. 

Recall that our goal is to estimate genotype frequency = Pr (x x = (gk,gi))- For 
{9k, gi) = (B,C) and (gk,gi) = (B,D), we report posterior distributions of 

1 rriki 1 rij2pkPi 

2j*w, -L 1 i-(^)} = — ^d -^^ = {g k ,g l )\y l ) = n{2pBpc + 2pBpD) - 

where m M = J27=i l{xi=(sfc,s0} and n i = Yn=i 1 {y 1 =BCD}- We report box plots of these pos- 
terior distributions in Figure HJ These box plots are not directly comparable to results in 
Figure |2} because our Bayesian analysis is based only on ten data sets, while the maximum 
analysis was done on 10,000 simulated data sets, one thousand for each value of / and for 
each genotype-to-phenotype mapping. To make these analyses comparable, one can study 
frequentist properties of Bayesian plug-in and imputation estimators based, for example, 
on the posterior median-based estimators of allele frequencies and genotype counts. Our 
Bayesian results are nonetheless consistent with the maximum likelihood analysis: imputa- 
tion estimators outperform the plug-in estimate in the case of nine phenotypes, none of the 
estimators have a uniform advantage across all values of the inbreeding coefficient in the 
eight phenotype case. 

We end our discussion of the Bayesian implementation of our imputation estimation by 
pointing out that this infere ntial framework i s already being used in evo lutionary biology, 



albeit somewhat informally 



Zhai et al 



2007 



Minin and Suchard 



2008bJ. These methods 



extend the idea of imputation evolutionary distance estimation to multiple sequences. 
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Figure 4: Genotype frequency estimation. We plot box plots of relative errors of plug-in, 
imputation, and Rao-Blackwellized imputation estimates of genotype frequencies (hbc an d 
/ibd) for two incomplete data mappings, with 9 and 8 observed phenotypes. Each trio of 
white, dark grey, and light grey box plots corresponds to an inbreeding coefficient that ranges 
from to 0.5. 
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7 Discussion 



We generalize the notion of imputation estimators and demonstrate that such estimators 
can be useful in a variety of incomplete data problems under model misspecification. We use 
simulations as our main tool in the first two examples and provide some simple asymptotic 
results in our last example. So far, our experience suggests that imputation estimators 
perform very well under mild model misspecification and when the fraction of missing data is 
reasonably small. Intuitively, it is clear that imputation estimators should be more successful 
as the amount of missing data decreases, because in the absence of missing information 
these estimators turn into sample means, which are model-free and consistent estimates of 
appropriate population-level quantities. However, to make this intuition useful, we need to 
connect formally efficiency of imputation estimators with the amount of missing data and 
degree of model misspecification. We hope to be able to make these connections in our future 
work. 

Studying sampling properties of imputation estimators proved to be difficult in general, 
especially since in practice the true sampling density of the observed data is unknown. In 
fact, in all our examples, we do not discuss how to compute the variance of the maximum 
likelihood-based imputation estimators. We recommend to use nonparametric bootstrap 
to explore sampling properties of imputation estimators. However, one should interpret 
bootstrap results with care, because imputation estimators remain biased even asymptoti- 
cally. Similar care needs to be applied to the interpretation of predictive distributions in the 
Bayesian context. 

Although we have not emphasized this throughout the paper, imputation estimators are 
usually easy to compute, which makes them particularly useful when a compromise between 
model complexity and computational efficiency results in an intentionally misspecified model. 
In our examples of model misspecification, we considered Gaussian mixture components, 
Hardy- Weinberg genotype frequencies, and parametric Markov models of DNA mutation. 
All these highly popular models owe a large portion of their success to their computational 
tractability. We argue that imputation estimators can take these and many other simple and 
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computationally efficient models one step further outside of their usual domain of application. 
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Appendix 



Proof of Theorem\J\ Defining m k i = X^Li l{yu=k,y 2i =i}] the misspecified complete-data log- 
likelihood takes the following form: 



*(y> &f) = ^ ^ lnp H (0 F , 1) 



(a-i; 



fees ZeS 

where pki{0 F , 1) is the probability of Xi = / conditional on starting Xq = A;. Recall that 



{pij(0 F , 1)}. Differentiating ( lA-lj) with respect model parameters, we 



arrive at the likelihood equations 



EE 



m k i dp kl (0 F , 1) 
p H (0 F ,l) 06 Fj 



0,j = l,...,r. 



(A-2) 



From backward Kolmogorov equation ^^P^- = A(#i?)P(#i?, t) with initial condition P(6 F , 0) 
I, we derive the following integral expression for the partial derivatives of transition proba- 
bilities: 

d_ 

Fj J uu Fj 



de 



P(0 F ,1) = / e Me ^JLA(e F )e A ^ l -T)dT. 
J o0 Fj 



Next, we write the imputation estimator in terms of rriki, 



where 



fees zes 



i 



E* (iV£l { x 1=i} |X = fc) 



E^ F (N c l {Xl=l} \X = k) = I / e A ^)A £ (^)e A( ^ )(1 - T) dr 



(A-3) 



Ball and Milne . 



2005) or 



Deriva tion of the formula (IA-31) can be found in 

2008al |. Condition ({TO]) says that there exist real constants c±, . . . ,c r such that 

dA{d F 



Minin and Suchardl . 



A C (6 F ) - I x Tr T (d F )A c (d F )l = J2 Ci- 



i=l 



d9, 



Of— Oi 
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Therefore, the difference between the plug-in and imputation estimators becomes 
1 n 

- % (N c | X = Vli , X, = y 2i ) - E dp (N c ) = 
i=i 

-EE T ,A I e A( ^[Aa^)-Ix7r^ F )A £ (^)l] e A(M(1 - T) dr 



fees leS 



p k i(0 F , 1) 



ki 



n 



i=i fees /eS 



Pm(O f , 1) 



,A(6f) 



0A(0j 



e A(d F ) (1 -r) dr 



F =0, 



kl 



-E c *EE 



^ (e F , i) 



TP — f 



n " Sis lis *) ^ 

because F satisfies likelihood equations (IA-2j) 



□ 



Proof of Theorem® As before, let = Y^=i l{yi»=fc,2/2i=0 - Using these site counts, define 



m Cl = E mfci ' m A = E mfc '' m ° = E mH ' 
(k,i)eCi {k,t)eC 2 k=i 



■> J£ 2 — — — > Id 



n 



n 



n 



where C 2 is defined by equation (fl2l) . Transition probabilities of the Kimura two-parameter 
model are obtained as 



I + l e -4/« _ l e -2(«+«t if ( jfcj I) e Cl 



p k i(a,(3,t) = < 



4 4 C 



if (A;,/) G £ 2 , 



(A-4) 



1 + I e -4/3t + i e -2(a+pyt [f k—l. 



Since the stationary distribution of the Kimura two-parameter model is uniform, (m^, m£ 2 ,mfl) 
Multinomial {pcuPChPd), where 



I —I— \ P -^ _ l p -2(a+/3) 

4 + 4 2 



P£2 = E jPki{a,P,l) = \ ~ ^e" 4/3 , p D = ^ip fcZ (a,/3,l) = ^ + ^e- 4/3 + i 



-2(a+/3) 



(fc,0ec a 
Therefore, 



k=i 



/il.S. ,. il.S. -> ,. el.S. 

£1 ->■ PA) ICa ->■ P-C 2 > and /£> 



(A-5) 
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by the strong law of large numbers. We will need these a.s. limits when we express both 
plug- in and imputation estimators in terms of /c i; fc 2 i an d fo- 

The mle of 7,7 = — | In [l — |(1 — fo)] , exists only if 1 — fjj < 3/4. Since we know that 



a - § - 3 _ 1-4/3 _ l p -2(a+j8) ^ 3 
4 4 C 2 



< |, we can safely assume that 7 is well defined for large 



enough n. The plug-in estimator 



E^(iY £l )=7 a 4 i ln 



1 _ 1 (1 _ I e -4/3 _ I e -2(a+/3) 

3 4 4 2 



a 1 /l + 2e 2 ^) 

p In 

f 4 1 3 



To derive the limit of the imputation estimator we start with 

n , 

- J> 7 ( N d I X ° = V* X i = V*) = E E ^TV E ^ (^Wo I *o = A:) . (A-6) 



Setting a = (5 in (1A-4j) . we obtain transition probabilities for the Jukes-Cantor model: 



Pki(l,t) 



{k=i}- 



(A-7) 



To get the functional form E 7 (A^c 1 l{Xi=z} | Xq = k), we first notice that A JC and Ag com- 
mute, leading to 

1 1 

> J - c ( T )e AJ °« 



| e AJC ^A^( 7 )e A ' TCW (i-) dr = Ag(7) e AJC W | dr = Ag 



Hence, 



E^(A £l l { x 1= o|Ao = A;) = 7 Q + ^e- 4 ^ 1 {(M)6A} + 7 Q - \e~^j 1 {W¥ A}- (A-8) 
Plugging (1A-7P and (1A-8j) to (1A-6|) . we arrive at 



1 n 

-E E i (N Cl \X = y ll ,X 1 
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1 - e- 4 ^ 1 + 3e" 4 ^ 
3/a x 



Plugging in limits (1A-5j) in the above formula produces the desired result. 



□ 



Proof of CorollaryUl Defining A = (e 4/3 + 2 e 2 ( Q+/3 ))/3, we write the limiting difference of 
the imputation and plug-in estimates as 

A In A 



im _ pi 
f'oo A* 00 



3(1 -A) 
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-4/3, 



1-e 



-2(o-/3)n 



(A-9) 



Since < A < 1, /i^ 1 — and a — (3 always have the same sign. Moreover, using e x > 1 + x, 
we can show that 

A\nA \n(l/A) /A , 

o < -T^a = W=T < < A - 10 > 

when a ^ (3. Recall that /i = 7r T A X2P (o;, = a. leading to 
//£-// = /3 - - In - a = — In 



4 V 3 / 4 V 3 

Hence, — \i and a — /3 always have opposite signs. 

Case 1: a > (3. We have < e" 4/3 (l - e _2(a_/3) ) < 2(a - which together with lA^TOl 
imply < /i^ 1 — < |(a — (3). Next, we use concavity of logarithm and arrive at 
fj,^ — ft < (a — /3)/3. Combining these last two inequalities, we have /i™ — fi < 0. Therefore, 

which proves the desired inequality. 

Case 2: a < (3. Recall that fj£ > fi. Plugging in > e" 4/3 (l - e -2 ^-^) = e~ 2{ - a+ ^ (e 2 ^) 

e 2(a-p) _ 1 tQ f[7^9|) we arriye at I ( e 2(a-/3) _ l) < ^ _ ^ < 0. So 

/C-M = /C-MS + /C-M> 3 (e 2 ^-l)- i ln ^ ± J. 

Defining the function on the right-hand side of the above inequality as w(a — (3), we show 
that w(0) = and 

1 j 3 e M + 2 3(e 2i + 2) 

for 5 < 0. Therefore, we have fO'™ — fi > w(a — (3) > and 

- /i = - + /c - m > /c - m > o . 

□ 
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